Chapter 8 Constructions and Idioms
library(tidyverse)
library(tidytext)
library(quanteda)
library(stringr)
library(jiebaR)
library(readtext)8.1 Collostruction
In this chapter, I would like to talk about the relationship between a construction and words. Words may co-occur to form collocation patterns. When words co-occur with a particular morphosyntactic pattern, they would form collostruction patterns.
Here I would like to introduce a widely-applied method for research on the meanings of constructional schemas—Collostructional Aanalysis (Stefanowitsch and Gries 2003). This is the major framework in corpus linguistics for the study of the relationship between words and constructions.
The idea behind collostructional analysis is simple: the meaning of a morphosyntactic construction can be determined very often by its co-occurring words.
In particular, words that are strongly associated (i.e., co-occurring) with the construction are referred to as collexemes of the construction.
Collostruction Analysis is an umbrella term, which covers several sub-analyses for constructional semantics:
- collexeme analysis
- co-varying collexeme analysis
- distinctive collexeme analysis
This chapter will focus on the first one, collexeme analysis, whose principles can be extended to the other analyses.
Also, I will demonstrate how we can conduct a collexeme analysis by using the R script written by Stefan Gries (Collostructional Analysis).
8.2 Corpus
I will use the Apple News Corpus from Chapter 7 as our corpus.
And in this demonstration, I would like to look at a particular morphosyntactic frame in Chinese, X + 起來. Our goal is simple: in order to find out the semantics of this constructional schema, it would be very informative if we can find out which words tend to strongly occupy this X slot of the constructional schema.
So our first step is to load the corpus into R.
8.3 Word Segmentation
Because Apple News Corpus is a raw-text corpus, we first word-segment the corpus.
# Initialize the segmenter
segmenter <- worker(user="demo_data/dict-ch-user.txt", bylines = F, symbol = T)
word_seg_text <- function(x, tagger){
x %>%
segment(jiebar = tagger) %>%
str_c(collapse=" ")
}
apple_df <- apple_corpus %>%
tidy %>%
filter(text !="") %>% #remove empty documents
mutate(doc_id = row_number())
apple_df <- apple_df %>% # create doccument index
mutate(text_tag = map_chr(text, word_seg_text, segmenter))8.4 Extract Constructions
With the words information, we can now extract our target patterns from the corpus using regular expressions.
# extract pattern
pattern_qilai = "[^\\s]+\\s起來\\b"
apple_df %>%
select(-text) %>%
unnest_tokens(output = construction,
input = text_tag,
token = function(x) str_extract_all(x, pattern=pattern_qilai)) -> apple_qilai
apple_qilai8.5 Distributional Information Needed for CA
To perform the collostructional analysis, which is essentially a statistical analysis of the association between the words and the constructions, we need to collect necessary distributional information.
Also, to use Stefan Gries’ R script of Collostructional Analysis, we need the following information:
- Joint Frequencies of Words and Constructions
- Frequencies of Words in Corpus
- Corpus Size (total number of words in corpus)
- Construction Size (total number of constructions in corpus)
8.5.1 Word Frequency List
# word freq
apple_df %>%
select(-text) %>%
unnest_tokens(word,
text_tag,
token = function(x) str_split(x, "\\s+|\u3000")) %>%
filter(nzchar(word)) %>%
count(word, sort = T) -> apple_word
apple_word8.5.2 Construction Frequencies
apple_qilai %>%
count(construction, sort=T) %>%
tidyr::separate(col="construction",
into = c("w1","construction"),
sep="\\s") %>%
mutate(w1_freq = apple_word$n[match(w1,apple_word$word)]) -> apple_qilai_table
# prepare for coll analysis
apple_qilai_table %>%
select(w1, w1_freq, n) %>%
write_tsv("qilai.tsv")In the later Stefan Gries’ R script, we need to have our input as a tab-delimited file.
8.5.3 Other Information
We prepare necessary distributional information for the later collostructional analysis.
## Corpus Size: 3209617
## Construction Size: 546
8.5.4 Creat Output File
This is to create an empty output txt file to keep the results from the Collostructional Analysis script.
## [1] TRUE
8.5.5 Run coll.analysis.r
Finally we are now really to perform the collostructional analysis using Stefan Gries’ coll.analysis.r.
This is an R script with interactive instructions. When you run the analysis, you will be prompted with guide questions, to which you would need to fill out necessary information/answers.
Specifically, data to be entered include:
analysis to perform: 1name of construction: QILAIcorpus size: 3209617freq of constructions: 546index of association strength: 1 (=fisher-exact)sorting: 4 (=collostruction strength)decimals: 2text file with the raw data: <qilai.tsv>output file: <qilai_results.txt>
The output of coll.analysis.r is as shown below:

8.6 Chinese Four-character Idioms
Many studies have shown that Chinese makes use of large proportion of four-character idioms in the discourse. This chapter will provide a exploratory analysis of four-character idioms in Chinese.
8.7 Dictionary Entries
In our demo_data directory, there is a file dict-ch-idiom.txt, which includes a list of four-character idioms in Chinese. These idioms are collected from 搜狗輸入法詞庫 and the original file formats (.scel) have been combined, removed of duplicate cases, and converted to a more machine-readable format, i.e., .txt.
Let’s first import the idioms in the file.
## [1] "阿保之功" "阿保之勞" "阿鼻地獄" "阿鼻叫喚" "阿斗太子" "阿芙蓉膏"
## [1] "罪無可逭" "罪人不帑" "作纛旗兒" "坐纛旂兒" "作姦犯科" "作育英才"
## [1] 56536
In order to make use of the tidy structure in R, we convert the data into a tibble:
8.8 Case Study: X來Y去
We can create a regular expression pattern to extract all idioms with the format of X來X去:
To analyze the meaning of this constructional schema, we may need to extract the X and Y in the schema:
idiom_laiqu <-idiom %>%
filter(str_detect(string, ".來.去")) %>%
mutate(pattern = str_replace(string, "(.)來(.)去", "\\1_\\2")) %>%
separate(pattern, into = c("w1", "w2"), sep = "_")
idiom_laiquOne empirical question is how many of these idioms are of the pattern X=Y (e.g., 想來想去, 直來直去) and how many are of X!=Y (e.g., 說來道去, 朝來暮去):
idiom_laiqu %>%
mutate(structure = ifelse(w1==w2, "XX","XY")) %>%
count(structure) %>%
ggplot(aes(structure, n, fill = structure)) + geom_col() 
8.9 Exercises
idiom and extract the idioms with the schema of 一X一Y.
idiom as our data source, now if we are interested in all idioms that have duplicated characters in them, with schemas like either _A_A or A_A_, where A is a fixed character. How can we extract all idioms of these two types from idiom? Also, provide the distribution of the two types.

X=Y vs. X!=Y.


References
Stefanowitsch, Anatol, and Stefan Th Gries. 2003. “Collostructions: Investigating the Interaction of Words and Constructions.” International Journal of Corpus Linguistics 8 (2). John Benjamins: 209–43.